In this notebook, a template is provided for you to implement your functionality in stages which is required to successfully complete this project. If additional code is required that cannot be included in the notebook, be sure that the Python code is successfully imported and included in your submission, if necessary. Sections that begin with 'Implementation' in the header indicate where you should begin your implementation for your project. Note that some sections of implementation are optional, and will be marked with 'Optional' in the header.
In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.
Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.
Design and implement a deep learning model that learns to recognize sequences of digits. Train the model using synthetic data generated by concatenating character images from notMNIST or MNIST. To produce a synthetic sequence of digits for testing, you can for example limit yourself to sequences up to five digits, and use five classifiers on top of your deep network. You would have to incorporate an additional ‘blank’ character to account for shorter number sequences.
There are various aspects to consider when thinking about this problem:
Here is an example of a published baseline model on this problem. (video)
Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.
import os
import sys
from six.moves.urllib.request import urlretrieve
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import idx2numpy
import time
from scipy.io import loadmat
t1 = time.clock()
train_and_valid_dataset = idx2numpy.convert_from_file('Data/train-images-idx3-ubyte')
test_dataset = idx2numpy.convert_from_file('Data/t10k-images-idx3-ubyte')
train_and_valid_labels = idx2numpy.convert_from_file('Data/train-labels-idx1-ubyte')
test_labels = idx2numpy.convert_from_file('Data/t10k-labels-idx1-ubyte')
t2 = time.clock()
print('Complete in %.2f seconds.'%(t2-t1))
# Arrange training, validation, and testing sets
train_dataset = train_and_valid_dataset[:50000,:]
valid_dataset = train_and_valid_dataset[50000:,:]
train_labels = train_and_valid_labels[:50000]
valid_labels = train_and_valid_labels[50000:]
print('Training set size: {}'.format(train_dataset.shape))
print('Validation set size: {}'.format(valid_dataset.shape))
print('Testing set size: {}'.format(test_dataset.shape))
print('Training label size: {}'.format(train_labels.shape))
print('Validation label size: {}'.format(valid_labels.shape))
print('Testing label size: {}'.format(test_labels.shape))
def create_sets(data, labels):
newdata = np.ndarray(shape=(data.shape[0], data.shape[1], 5*data.shape[2]))
newlabels = np.ndarray(shape=(data.shape[0], 5))
blank_digit = np.zeros(shape=(28,28))
blank_labels = np.array([10,10,10,10,10])
data_index = 0
newdata_index = 0
while newdata_index < data.shape[0]:
# length of sequence
length = np.random.randint(1,5)
# images of digits for sequence
data_digits = []
for i in range(length):
try:
data_digits.append(data[data_index+i,:,:])
except:
print(data_index, i)
# labels for sequence
label_digits = np.empty(shape=(5))
label_digits[0:length] = labels[data_index:data_index+length]
label_digits[length:] = blank_labels[length:]
# format the data
temp_digits = []
for i in range(5):
if i < length:
temp_digits.append(data_digits[i])
else:
temp_digits.append(blank_digit)
sequence = np.concatenate(temp_digits, axis=1)
# append the formatted data to the new arrays
newdata[newdata_index,:,:] = sequence
newlabels[newdata_index] = label_digits
# update indexes
data_index += length
data_index %= (data.shape[0]-5)
newdata_index += 1
return newdata, newlabels
train_dataset, train_labels = create_sets(train_dataset, train_labels)
valid_dataset, valid_labels = create_sets(valid_dataset, valid_labels)
test_dataset, test_labels = create_sets(test_dataset, test_labels)
num_channels = 1 # greyscale
image_size = 28
num_labels = 11
seq_length = 5
def reformat_conv(dataset, labels):
dataset = dataset.reshape(
(-1, image_size, seq_length*image_size, num_channels)).astype(np.float32)
#labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat_conv(train_dataset, train_labels)
valid_dataset, valid_labels = reformat_conv(valid_dataset, valid_labels)
test_dataset, test_labels = reformat_conv(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)
# TRIM VALID AND TEST SETS TO AVOID CRASHING JUPYTER
valid_dataset = valid_dataset[:2000]
valid_labels = valid_labels[:2000]
test_dataset = test_dataset[:2000]
test_labels = test_labels[:2000]
def elementwise_accuracy(predictions, labels):
preds = np.transpose(np.argmax(predictions, 2))
return 100*np.sum(preds == labels)/len(preds)/seq_length
def accuracy(predictions, labels):
total = 0
preds = np.transpose(np.argmax(predictions, 2))
for i in range(len(preds)):
if sum(preds[i] == labels[i]) == 5:
total += 1
return 100*total/len(preds)
batch_size = 32
patch_size = 5
depth1 = 32
depth2 = 64
num_hidden = 128
beta = 8e-3
graph = tf.Graph()
with graph.as_default():
# Input data.
tf_train_dataset = tf.placeholder(
tf.float32, shape=(batch_size, image_size, image_size*seq_length, num_channels))
tf_train_labels = tf.placeholder(tf.int32, shape=(batch_size, seq_length))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Variables
layer1_weights = tf.get_variable("ConvW1", shape=[patch_size, patch_size,
num_channels, depth1], initializer=tf.contrib.layers.xavier_initializer())
layer1_bias = tf.Variable(tf.zeros([depth1]), name="ConvB1")
layer2_weights = tf.get_variable("ConvW2", shape=[patch_size, patch_size,
depth1, depth2], initializer=tf.contrib.layers.xavier_initializer())
layer2_bias = tf.Variable(tf.zeros([depth2]), name="ConvB2")
layer3_weights = tf.get_variable("FcW1", shape=[image_size // 4 * image_size // 4 * depth2,
num_hidden], initializer=tf.contrib.layers.xavier_initializer())
layer3_bias = tf.Variable(tf.zeros([num_hidden], name="FcB1"))
layer4_weights = tf.get_variable("ClfW", shape=[num_hidden, num_labels],
initializer=tf.contrib.layers.xavier_initializer())
layer4_bias = tf.Variable(tf.zeros([num_labels], name="ClfB"))
# Model.
def model(data, dropout=False):
split_data = tf.split(split_dim=2, num_split=5, value=data)
digits = []
for digit in split_data:
# First convolutional layer
conv1 = tf.nn.conv2d(digit, layer1_weights, [1, 1, 1, 1], padding='SAME')
hidden1 = tf.nn.relu(conv1 + layer1_bias)
pool1 = tf.nn.max_pool(hidden1, [1,2,2,1], [1,2,2,1], padding='SAME')
# Second convolutional layer
conv2 = tf.nn.conv2d(pool1, layer2_weights, [1, 1, 1, 1], padding='SAME')
hidden2 = tf.nn.relu(conv2 + layer2_bias)
pool2 = tf.nn.max_pool(hidden2, [1,2,2,1], [1,2,2,1], padding='SAME')
# Flatten the tensor
shape = pool2.get_shape().as_list()
reshape = tf.reshape(pool2, [shape[0], shape[1] * shape[2] * shape[3]])
# Dropout
if dropout:
reshape = tf.nn.dropout(reshape, keep_prob=0.6)
hidden3 = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_bias)
if dropout:
hidden3 = tf.nn.dropout(hidden3, keep_prob=0.6)
digits.append(tf.matmul(hidden3, layer4_weights) + layer4_bias)
return digits
#Training computation.
train_logits = model(tf_train_dataset, dropout=True)
loss = 0
for i in range(5):
loss += tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(train_logits[i], tf_train_labels[:,i]))
loss += beta*(tf.nn.l2_loss(layer1_weights)+tf.nn.l2_loss(layer2_weights)+tf.nn.l2_loss(layer3_weights)+tf.nn.l2_loss(layer4_weights))
# Optimizer.
global_step = tf.Variable(0)
learning_rate = tf.train.exponential_decay(0.01, global_step,
decay_steps=500, decay_rate=0.95, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
# Predictions for the training, validation, and test data.
valid_logits = model(tf_valid_dataset)
test_logits = model(tf_test_dataset)
for i in range(5):
valid_logits[i] = tf.nn.softmax(valid_logits[i])
test_logits[i] = tf.nn.softmax(test_logits[i])
train_prediction = tf.pack(train_logits)
valid_prediction = tf.pack(valid_logits)
test_prediction = tf.pack(test_logits)
saver = tf.train.Saver()
num_steps = 10001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
#saver.restore(session, tf.train.latest_checkpoint('./'))
print('Initialized')
for step in range(num_steps):
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
batch_labels = train_labels[offset:(offset + batch_size), :]
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 1000 == 0):
print('Minibatch loss at step %d: %f' % (step, l))
print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
print('Validation accuracy: %.1f%%' % accuracy(
valid_prediction.eval(), valid_labels))
print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))
save_path = saver.save(session, "./tensorflowcheckpoints/sess")
print('Model saved to {}'.format(save_path))
What approach did you take in coming up with a solution to this problem?
Answer:
After loading the training and testing data, I created sequences of between 1 and 5 digits. The sequences are input into the model, which is a two-layer convolutional network. I chose a convolutional network because it is an effective tool for processing images.
The model splits the sequences into five separate digits and performs the convolutions and final connections independently, into 11 output classes representing the digits 0-9 and a blank digit, encoded with a 10.
What does your final architecture look like? (Type of model, layers, sizes, connectivity, etc.)
Answer:
The final architecture has two 5x5 convolutions with 2x2 max pooling of stride 2, followed by a fully connected layer with dropout. L2 regularization is also used.
The optimizer is Adam optimizer. I chose this one because it seems to work better than gradient descent and the momentum optimizer.
How did you train your model? How did you generate your synthetic dataset? Include examples of images from the synthetic data you constructed.
Answer:
I trained the model on sequences of variable length. The synthetic dataset is generated by the create_sets function, which selects a number between 1 and 5, and appends digits and labels from the raw dataset to numpy arrays. Some examples of training data and labels can be seen below.
def plotnum(data):
plt.imshow(data)
plt.show()
for _ in range(3):
i = np.random.randint(100,200)
plotnum(np.squeeze(train_dataset[i]))
print(train_labels[i].astype(np.int32))
Once you have settled on a good architecture, you can train your model on real data. In particular, the Street View House Numbers (SVHN) dataset is a good large-scale dataset collected from house numbers in Google Street View. Training on this more challenging dataset, where the digits are not neatly lined-up and have various skews, fonts and colors, likely means you have to do some hyperparameter exploration to perform well.
Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.
import os
import sys
from six.moves.urllib.request import urlretrieve
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import idx2numpy
import time
from scipy.io import loadmat
train_data = loadmat('cropped_data/train_32x32.mat')
test_data = loadmat('cropped_data/test_32x32.mat')
# this converts the labels for zero to 0 [was 10]
ten_to_zero = np.vectorize(lambda x: 0 if x==10 else x)
train_data['y'] = ten_to_zero(train_data['y'])
train_dataset = train_data['X']
test_dataset = test_data['X']
train_labels = train_data['y']
test_labels = test_data['y']
# switch image index to first index
train_dataset = np.rollaxis(train_dataset, 3)
test_dataset = np.rollaxis(test_dataset, 3)
# Rearrange to [-1,1)
train_dataset = (train_dataset / 128.0) - 1.0
test_dataset = (test_dataset / 128.0) - 1.0
# Create validation set
valid_dataset = train_dataset[0:5000, :, :, :]
valid_labels = train_labels[0:5000, :]
train_dataset = train_dataset[5000:, :, :, :]
train_labels = train_labels[5000:, :]
# Shrink Test Dataset because jupyter keeps crashing
valid_dataset = valid_dataset[:2000]
valid_labels = valid_labels[:2000]
test_dataset = test_dataset[:2000]
test_labels = test_labels[:2000]
print(train_dataset.shape, train_labels.shape)
print(valid_dataset.shape, valid_labels.shape)
print(test_dataset.shape, test_labels.shape)
image_size = 32
num_channels = 3
num_labels = 10
def reformat_conv(dataset, labels):
dataset = dataset.reshape(
(-1, image_size, image_size, num_channels)).astype(np.float32)
labels = (np.arange(num_labels) == labels[:,]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat_conv(train_dataset, train_labels)
valid_dataset, valid_labels = reformat_conv(valid_dataset, valid_labels)
test_dataset, test_labels = reformat_conv(test_dataset, test_labels)
print(train_dataset.shape, train_labels.shape)
print(valid_dataset.shape, valid_labels.shape)
print(test_dataset.shape, test_labels.shape)
def accuracy(preds, labels):
return 100.0*( np.sum(np.argmax(preds,1) == np.argmax(labels,1)))/(preds.shape[0])
image_size = 32
num_channels = 3
depth1 = 16
depth2 = 32
depth3 = 64
depth4 = 128
kp_fc = 0.5
kp_conv = 0.9
depth5 = 256
num_classes = 10
beta=5e-5
batch_size = 20
patch_size = 4
num_labels = 10
graph = tf.Graph()
with graph.as_default():
# Input
tf_train_dataset = tf.placeholder(tf.float32, shape=(None,image_size,image_size,num_channels))
tf_train_labels = tf.placeholder(tf.float32, shape=(None,num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.cast(tf.constant(test_dataset), tf.float32)
# Variables
weight_layer1 = tf.get_variable("ConvW1", shape=[patch_size, patch_size,
num_channels, depth1], initializer=tf.contrib.layers.xavier_initializer())
bias_layer1 = tf.Variable(tf.constant(1.0, shape=[depth1]))
weight_layer2 = tf.get_variable("ConvW2", shape=[patch_size, patch_size,
depth1, depth2], initializer=tf.contrib.layers.xavier_initializer())
bias_layer2 = tf.Variable(tf.constant(1.0, shape=[depth2]))
weight_layer3 = tf.get_variable("ConvW3", shape=[patch_size, patch_size,
depth2, depth3], initializer=tf.contrib.layers.xavier_initializer())
bias_layer3 = tf.Variable(tf.constant(1.0, shape=[depth3]))
weight_layer4 = tf.get_variable("ConvW4", shape=[patch_size, patch_size,
depth3, depth4], initializer=tf.contrib.layers.xavier_initializer())
bias_layer4 = tf.Variable(tf.constant(1.0, shape=[depth4]))
weight_layer5 = tf.get_variable("FcW1", shape=[2 * 2 * depth4, depth5], initializer=tf.contrib.layers.xavier_initializer())
bias_layer5 = tf.Variable(tf.constant(1.0, shape=[depth5]))
weight_layer6 = tf.get_variable("FcW2", shape=[depth5, num_classes], initializer=tf.contrib.layers.xavier_initializer())
bias_layer6 = tf.Variable(tf.constant(1.0, shape=[num_classes]))
# Model
def model(data, dropout=False):
# Convolution 1
conv_1 = tf.nn.conv2d(data, weight_layer1, [1,1,1,1], padding='SAME')
hidden_1 = tf.nn.relu(conv_1 + bias_layer1)
pool_1 = tf.nn.max_pool(hidden_1, [1,2,2,1], [1,2,2,1], padding='SAME')
if dropout:
pool_1 = tf.nn.dropout(pool_1, keep_prob=kp_conv)
# Convolution 2
conv_2 = tf.nn.conv2d(pool_1, weight_layer2, [1,1,1,1], padding='SAME')
hidden_2 = tf.nn.relu(conv_2 + bias_layer2)
pool_2 = tf.nn.max_pool(hidden_2, [1,2,2,1], [1,2,2,1], padding='SAME')
if dropout:
pool_2 = tf.nn.dropout(pool_2, keep_prob=kp_conv)
# Convolution 3
conv_3 = tf.nn.conv2d(pool_2, weight_layer3, [1,1,1,1], padding='SAME')
hidden_3 = tf.nn.relu(conv_3 + bias_layer3)
pool_3 = tf.nn.max_pool(hidden_3, [1,2,2,1], [1,2,2,1], padding='SAME')
if dropout:
pool_3 = tf.nn.dropout(pool_3, keep_prob=kp_conv)
# Convolution 4
conv_4 = tf.nn.conv2d(pool_3, weight_layer4, [1,1,1,1], padding='SAME')
hidden_4 = tf.nn.relu(conv_4 + bias_layer4)
pool_4 = tf.nn.max_pool(hidden_4, [1,2,2,1], [1,2,2,1], padding='SAME')
# Reshape
shape = pool_4.get_shape().as_list()
reshape = tf.reshape(pool_4, [-1, shape[2] * shape[3] * shape[1]])
if dropout:
reshape = tf.nn.dropout(reshape, keep_prob=kp_fc)
# Fully Connected
fc = tf.nn.relu(tf.matmul(reshape, weight_layer5) + bias_layer5)
# Output Classes
return tf.matmul(fc, weight_layer6) + bias_layer6
# Training computation.
logits = model(tf_train_dataset, dropout=True)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))
loss += beta*(tf.nn.l2_loss(weight_layer1)+tf.nn.l2_loss(weight_layer2)+tf.nn.l2_loss(weight_layer3)+tf.nn.l2_loss(weight_layer4)+tf.nn.l2_loss(weight_layer5)+tf.nn.l2_loss(weight_layer6))
# Optimizer.
global_step = tf.Variable(0)
learning_rate = tf.train.exponential_decay(0.001, global_step,
decay_steps=10000, decay_rate=0.95, staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(logits)
valid_prediction = tf.nn.softmax(model(tf_valid_dataset))
test_prediction = tf.nn.softmax(model(tf_test_dataset))
saver = tf.train.Saver()
num_steps = 200001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
for step in range(num_steps):
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
batch_labels = train_labels[offset:(offset + batch_size), :]
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
_, l, predictions = session.run(
[optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 500 == 0):
print('Minibatch loss at step %d: %f' % (step, l))
print('Minibatch accuracy: %.1f%%' % accuracy(predictions, batch_labels))
if (step % 2000 == 0):
print('Validation accuracy: %.1f%%' % accuracy(
valid_prediction.eval(), valid_labels))
print('Test accuracy: %.1f%%' % accuracy(test_prediction.eval(), test_labels))
save_path = saver.save(session, "./tensorflowcheckpoints/svhn")
print('Model saved to {}'.format(save_path))
Describe how you set up the training and testing data for your model. How does the model perform on a realistic dataset?
Answer:
The testing and training data are the cropped SVHN digits. because of the variety in the shapes and colours of the digits, the accuracy of the trained network is only about 92%, lower than the 99%+ achievable on MNIST.
What changes did you have to make, if any, to achieve "good" results? Were there any options you explored that made the results worse?
Answer:
I tried several different optimizers, learning rates, and regularization configurations before settling on the current model. The momentum optimizer did not work well, nor did high learning rates or a high beta for L2 regularization.
This model is quite different from the one used for MNIST sequences. First of all, this model works on the cropped digits only. It is also much deeper, using 4 convolutional layers with max pooling and 90% dropout, and a fully connected layer with 50% dropout.
What were your initial and final results with testing on a realistic dataset? Do you believe your model is doing a good enough job at classifying numbers correctly?
Answer:
Many models I tested would not score above 20.4% for the validation set. This final model, too, did not improve for the first 60,000 steps, before it suddenly started learning.
The benchmark for the Google paper was 98%. This is selected as the threshold comparable to human accuracy. My model achieves only 92%, so it would not be practical in a real-world setting.
Take several pictures of numbers that you find around you (at least five), and run them through your classifier on your computer to produce example results. Alternatively (optionally), you can try using OpenCV / SimpleCV / Pygame to capture live images from a webcam and run those through your classifier.
Use the code cell (or multiple code cells, if necessary) to implement the first step of your project. Once you have completed your implementation and are satisfied with the results, be sure to thoroughly answer the questions that follow.
from IPython.display import Image as iImage, display
picfolder_path = "./Pics"
pic_paths = os.listdir(picfolder_path)
for i in range(5):
# display pics
display(iImage(os.path.join(picfolder_path, pic_paths[i]), width=100, height=100))
labels = np.array([7,2,5,5,9])